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Andrew  F.  Siegel 
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ABSTRACT 


The  repeated  median  algorithm  is  a  robustified  U-statistic 
in  which  nested  medians  replace  the  single  mean.  Unlike  many 
generalizations  of  the  univariate  median,  repeated  median  esti¬ 
mates  maintain  the  high  50%  breakdown  value  and  can  resist  the 
effects  of  outliers  even  when  they  comprise  nearly  half  of  the 
data.  Because  they  are  calculated  directly,  not  iteratively, 
repeated  median  procedures  can  be  used  as  starting  values  for 
iterative  robust  estimation  methods.  For  bivariate  linear  regres¬ 
sion  with  symmetric  errors,  repeated  median  estimates  are  unbiased 
and  Fisher  consistent,  and  their  efficiency  under  Gaussian  sampling 
can  be  comparable  to  the  efficiency  of  the  univariate  median. 


Key  Words :  Breakdown  Value,  U-Statistic,  Resistance. 


1 .  INTRODUCTION 


Robust  regression  procedures  based  on  medians  have  been  con¬ 
sidered  by  Thiel  ( 1 950) ,  Mood( 1 950 ,  p . 406 ) »  Brown  and  Mood(1951), 
Sen(1968),  Mari tz( 1 979) ,  and  others.  Such  high-breakdown  proce¬ 
dures  are  of  Interest  for  several  reasons.  First,  some  applied 
problems,  including  the  editing  of  data,  require  maximal  protec¬ 
tion  against  the  presence  of  outliers.  Siegel  and  Benson  (1980) 
provide  an  example  of  this  need  in  the  comparison  of  shapes. 
Secondly,  many  of  the  more  efficient  robust  procedures,  including 
M-estimates  (Huber,  1973)  are  iterative  and  require  directly 
computable  resistant  starting  values  (Andrews,  1974)  to  guard 
against  convergence  to  a  non-robust  local  optimum  near  the  least- 
squares  solution.  Finally,  the  extreme  case  of  high-breakdoWn 
estimates  should  be  well  understood. 

The  repeated  median  algorithm  is  defined  in  Section  2  as  a 
modified  U-statistlc  in  which  nested  medians  are  used  instead  of 
a  single  mean,  and  their  computational  complexity  is  found.  The 
breakdown  value  is  shown  in  Section  3  to  be  50%,  the  best  possible 
for  unbounded  invariant  estimators  and  an  improvement  upon  pre¬ 
viously  considered  median  procedures.  Under  suitable  conditions, 
repeated  median  estimates  are  unbiased  and  Fisher  consistent,  as 
shown  in  Section  4,  and  their  efficiency  under  Gaussian  sampling 
can  be  comparable  to  the  efficiency  of  the  univariate  median. 
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2.  THE  REPEATED  MEDIAN  ALGORITHM 


We  first  consider  the  bivariate  linear  case  of  fitting  a 
robust  line  Y*A+BX  to  the  data  (X^.Y^)  ,  1*1, ...,n  with 

distinct  X^  .  Define  the  pairwise  slope  B(i , j )*(Yj-Y^ J/(Xj-X^ ) 
of  the  line  from  point  i  to  point  j  .  These  n(n-l)/2  slope 
estimates  will  be  condensed  into  a  single  number  using  two  stages 
of  medians.  The  repeated  median  estimate  of  slope  is 


A 


B 


Median  /Median  *  >  ■« 

i  1  jjH  ' 


(2.1) 


The  inner  median  is  the  median  slope  of  the  lines  that  pass 
through  point  i  .  We  can  visualize  (2.1)  as  the  median  of  the 
column  medians  (or  row  medians,  by  symmetry)  of  the  B(i,j)  matrix, 
ignoring  entries  along  the  main  diagonal.  This  is  not  an  iterative 
method;  if  we  calculate  (2.1)  using  the  residuals  R^*Y^-BX.  in 
place  of  Y j  ,  we  obtain  zero  by  additive  invariance  of  the  median. 

The  y>1ntercept  A  can  be  estimated  in  two  ways.  If  we  use 
the  value  B  just  estimated,  a  single  median  will  suffice  for 
this  hierarchical  approach: 


A<’>  ■  (Y,-S  X,)  (2.2) 

Otherwise,  A  can  be  estimated  directly  using  a  double  median 
as  in  (2.1)  to  obtain 

j(2)  Median  /Median 

A  1  1  Jj*1 


A(  1 ,  j  )  ) 


(2.3) 
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where  A(1,j)  *  (XjY^-X^Yj)/(Xj-X^  )  is  the  y-intercept  of  the 
line  connecting  points  i  and  j  .  Less  time  is  required  for 
computing  the  hierarchical  estimate  (2.2),  but  direct  estimation 
(as  in  2.3)  is  invariant  to  the  ordering  of  the  parameters  A 
and  B  . 

The  general  repeated  median  algorithm  is  like  a  U-statistic 
(Hoeffding,  1948),  except  that  nested  medians  replace  the  over¬ 
all  mean.  We  therefore  obtain  a  general  procedure  for  estimating 
a  real  parameter  8  whenever  there  is  a  positive  integer  k  such 
that  every  subset  of  k  data  points  determines  a  value  of  0  ; 

say  points  numbered  i-j . ik  determine  8( 1 j » . . . ,i k)  .  The 

mean  of  these  estimates,  if  we  have  n  data  points  in  all,  is 
the  U-statistic. 


Using  a  median  In  place  of  the  mean,  we  can  robustify  this  some 
what  to 


Median 

( 1  <  1  -j <  * .  .<1 


k<n){0^l . V  } 


(2.5) 


which  Includes  the  case  of  regression  estimates  considered  by 
Thiel ( 1 950)  and  Sen(1968). 

Repeated  median  estimates  use  a  succession  of  k  partial 
medians.  Begin  by  reducing  the  number  of  Indices  from  k  to 
k-1  . 
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0(1. 


,1 


k-1 


Median 

i  l^i  i  1 , . 


0(1 


k-1 


’V 


(2.6) 


This  process  can  be  repeated,  and  with  each  median  an  Index  Is 
deleted.  Finally,  the  repeated  median  estimate  Is 


(2.7) 


For  example.  In  the  multiple  regression  model 

Y  »  A  +  B]X1  +  B2X2  (2.8) 

B.|  would  be  estimated  using  a  triple  median 

(M;s:,n  [stir  (2-9> 

where  Bj(1,j,k)  Is  the  B^  coefficient  of  the  plane  (2.8)  deter¬ 
mined  by  points  1,j,  and  k.  Colinearity  problems  can  be  handled 
by  considering  only  those  triples  that  actually  determine  a  value 
for  B-|  . 

When  more  than  one  parameter  is  to  be  estimated,  they  can  be 
estimated  hierarchically  using  Information  on  previously  estimated 
parameters  at  each  stage  or  directly  using  (2.7)  for  each  para¬ 
meter.  These  two  approaches  were  Illustrated  In  (2.2)  and  (2.3), 
and  the  same  considerations  apply  in  general. 
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The  computational  complexity  of  (2.7)  Is  0(n  )  because 
the  total  number  of  medians  of  n-1  or  fewer  numbers  that  must 
be  performed  Is 

k-1  f  i 

1  +  Z  n  (n 
1»1  lj-1 

and  an  0(n)  algorithm  Is  available  for  calculating  the  median 
(Knuth,  Vol .  Ill,  1973,  Section  5.3.3,  p.  216). 


3.  BREAKDOWN  VALUE 


Breakdown  value  is  a  measure  of  the  ability  of  an  estimator 
to  resist  the  effects  of  outliers  (Hodges,  1967,  and  Hampel,  1971). 
It  is,  roughly  speaking,  the  largest  fraction  of  the  data  that 
can  be  arbitrarily  changed  while  the  estimator  is  guaranteed  to 
remain  bounded.  The  arithmetic  mean  has  a  breakdown  value  of  0%, 
while  the  univariate  median  achieves  nearly  50%  because  t(n-l)/2] 
out  of  n  points  can  be  changed  while  the  median  remains  bounded 
(brackets  indicate  the  greatest  integer  function).  This  value, 

50%,  is  the  highest  possible  for  invariant  unbounded  estimators. 

Median-based  regression  methods  do  not  necessarily  preserve 
the  highest  possible  50%  breakdown  value  of  the  univariate  median. 
For  example,  least  absolute  error  regression  (Bassett  and  Koenker, 
1978)  has  a  breakdown  value  of  zero  (0%);  the  figure  shows  an 
example  in  which  the  least  absolute  error  regression  line  can  be 
controlled  by  changing  only  the  height  of  a  single  point. 

The  Mood-Brown  procedure  for  bivariate  linear  regression 
(Mood,  1950;  and  Brown  and  Mood,  1951)  requires  that  the  median 
residual  be  zero  for  both  halves  (low  X  and  high  X)  of  the  data. 
Because  half  of  the  data  In  either  group  can  control  the  estimated 
line,  the  breakdown  value  Is  25%.  The  breakdown  value  of  Andrews' 
median-based  regression  method  Is  also  at  most  25%  (Andrews,  1974, 
Section  5). 


FIGURE  1.  The  height  of  a  single  Influential  point  can 
control  the  least  absolute  error  regression  line. 
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The  overall  median  procedure  (2.5),  studied  by  Thiel (1950) 
and  Sen(1968)  for  bivariate  linear  regression,  has  a  breakdown 
value  of  29%.  In  higher  dimensions,  with  subsets  of  k  points 
at  a  time,  the  breakdown  value  is  1-2”^^).  This  is  found  by 
setting  the  ratio  of  the  number  of  unchanged  to  total  estimates 
6( i  1 ,  •  •  •  »i l()  equal  to  1/2,  the  breakdown  value  for  the  median. 
When  the  primitive  estimates  e(1j ,. . .  ,1^)  are  themselves 
robust,  the  resulting  breakdown  value  can  be  higher. 

The  repeated  median  procedure  has  an  asymptotic  breakdown 
value  of  50%  (as  n->-°°  with  k  fixed)  because  each  nested  median 
in  (2.7)  involves  n  or  fewer  terms  (the  overall,  nonrepeated 
median  (2.5)  involves  n  terms  at  once  in  a  single  median). 

This  is  shown  in  the  following  theorem  which  finds  the  exact 
breakdown  value  in  small  samples: 

Theorem.  The  repeated  median  estimate  (2.7)  will  remain 
bounded  whenever  more  than  (n+k-l)/2  points  are  held  fixed  while 
the  remaining  points  are  arbitrarily  moved,  provided  each  subset 
of  k  of  the  fixed  points  determines  a  value  9(i^ ... .  ,1^). 

This  theorem  is  a  consequence  of  a  more  general  lemma. 

Lemma .  Consider  a  class  of  functions  8a(i, ,. . . ,i.  )  where 
l<ij<n  are  Integers  and  different  values  of  a  can  be  thought 

of  as  different  data  configurations.  Suppose  AC{1 . n}  has 

more  than  (n+k-l)/2  elements  and  0  (1, . ,1^)  are  bounded 
(as  a  varies)  whenever  1,,...,1^6A  .  Then  the  repeated  median 
values  9a  calculated  from  (2.7)  are  also  bounded. 


- 
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Proof .  We  proceed  by  induction  on  k  .  When  k=l  ,  this 
reduces  to  the  breakdown  bound  of  the  univariate  median.  Now 
assume  the  hypotheses  of  the  lemma.  Performing  the  innermost 
median  (2.6)  in  (2.7)  we  see  that 


9  a*  V 


k-1 


Medi an 
1 k^f 1 i 


k-1 


}9a^' 


■U  ) 


are  bounded  whenever  i^ . .i^  ^€A  because  the  median  has  n-k 
terms,  of  which  more  than  half  are  bounded.  Note  that  for  each 
a  ,  the  k-fold  repeated  median  of  8a(1j  * . . .  *1  is  identical  to 
the  (k-l)-fold  repeated  median  of  0a(i  j , . . .  ,i  ,  •*) .  These 
are  seen  to  be  bounded  by  using  the  induction  hypothesis.  □ 
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Proof .  We  proceed  by  induction  on  k  .  When  k=l  ,  this 
reduces  to  the  breakdown  bound  of  the  univariate  median.  Now 
assume  the  hypotheses  of  the  lemma.  Performing  the  innermost 
median  (2.6)  in  (2.7)  we  see  that 


...  Median 


.  •  ,ik_i  > 


9  a(  i  1  »  •  •  •  »^|r) 


are  bounded  whenever  i ^ , . . . ,i k_]€A  because  the  median  has  n-k 
terms,  of  which  more  than  half  are  bounded.  Note  that  for  each 
o  ,  the  k-fold  repeated  median  of  9a ( i -j , . . . , i k)  is  identical  to 
the  (k-l)-fold  repeated  median  of  0a( i ^ , . . . , i k_i  ,  •*) .  These 
are  seen  to  be  bounded  by  using  the  induction  hypothesis.  □ 
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4.  UNBIASEDNESS,  FISHER  CONSISTENCY.  AND  EFFICIENCY 

The  repeated  median  estimates  are  unbiased  in  the  bivariate 
linear  model 

Yi  *  A  +  BX.  +  e.  ,  i-1 .  n  (4.1) 

0 

with  fixed  and  symmetric  errors  for  which  e  )  ■ 

A 

(-e-j,...,  -en).  The  slope  estimate  B  from  (2.1)  is  symmetrically 
distributed  about  the  true  slope  B  because 


=- ( B-B )  (4.2) 

A 

Thus  E(B)*B  whenever  the  expectation  exists.  We  find  similarly 

A 

that  A  is  symmetrically  distributed  about  A  for  both  the 

single  median  (2.2)  and  the  double  median  (2.3)  calculation. 

Repeated  median  estimates  are  Fisher  consistent  for  bivariate 

distributions  in  which  Y  given  X  is  symmetrically  distributed 

0 

about  a  center  that  Is  linear  In  X  ,  so  that  (X,  Y  -  A  -  BX)  * 
(X,  -(Y-A-BX)).  Fisher  consistency  requires  that  when  we  evaluate 
the  estimator  at  the  actual  population  distribution  (not  at  .a 
sample),  we  obtain  the  population  parameter  (Cox  and  Hinkley, 
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1974,  p.  287).  The  repeated  median  procedure  (2.7)  extends  to 
allow  us  to  estimate  the  slope  B  given  a  distribution  (X,  Y)^F  . 
Assume  the  X  marginal  is  continuous  and  define 


„  a  Median  fMedian  Y*-Y~ 

(X.YKF  [(X'.Y-M  xTI7, 


(A. 3) 


This  is  algebraically  equivalent  to 

p  R  _  Median  f” Med i an 

'  (X’Y>"F  [( X  1  ,Y  '  )<v-F 

./Median  f~  Median 

V.(  X » Y )  'vF  L(X*,Y*)%F 


“-(B-B) 


Y1 -A-BX1 ) - ( Y -A-BX 
X'-X 


X'-X 


(4.4) 


where  the  last  equality  follows  by  symmetry.  Because  these  are 
fixed,  not  random,  variables,  (4.4)  must  be  zero  and  we  have 

A  A 

B»B  .  Similarly,  it  can  be  shown  that  A*A  regardless  of  whether 

A 

A  is  found  using  a  single  or  double  median. 

The  efficiency  of  repeated  median  regression,  in  the  presence 
of  Gaussian  errors,  is  not  far  from  the  efficiency  of  the  uni¬ 
variate  median,  as  shown  in  the  table  for  evenly  spaced  and  for 
Gaussian  X  values.  Efficiency  here  is  the  ratio  of  the  vari¬ 
ances  of  the  least  squares  and  median-based  estimates.  For  the 
univariate  median,  this  ratio  is  assymptotical  ly  2/ir  ■  .64 

(Cramer,  1946,  p.  369). 

Efficiencies  for  repeated  median  regression  were  estimated 
using  Monte  Carlo  computer  simulation  techniques.  For  each  table 


entry,  10,000  replications  were  performed  in  order  to  achieve  an 
estimated  standard  error  of  the  efficiency  smaller  than  .01. 
Simulations  were  done  on  Princeton  University's  IBM  3033  Computer 
using  the  IMSL  subroutine  ggnpm  for  pseudorandom  Gaussian  deviates. 
Three  X  designs  were  chosen:  evenly  spaced,  even  Gaussian 
percentiles  ( ( i-HJ/n) ,  1*1,  ...»  n  where  *  denotes  the 
standard  Gaussian  cumulative  distribution  function)  and  random 
Gaussian  deviates  chosen  independently  for  each  replication. 


TABLE  1. 


Efficiency  of  repeated  median  regression 
bivariate  slope  estimation 
wl  th 

Independent  Gaussian  errors, 
by  Monte  Carlo  simulation 


evenly  spaced 


X  design 

Gausslln 

even  percentiles 


ranr 


10 


.69 


.64 


.53 


20 


.73 


.66 


.61 
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